server: bench: minor fixes #10765

phymbert · 2024-12-10T16:23:27Z

Context

After a nice exchange with @ngxson, this is a minor change to the current server bench framework in order to refresh it a bit. Although the target would be to replace k6/xk6-sse with something like Locust (to be assessed) , python based.

Changes

support openAI streaming standard output with [DONE]\n\n
export k6 raw results in csv
fix too many tcp idle connection in tcp_wait
add metric time to emit first token
wait for the server to be ready in the CI script

Tests (phi2 on RTX 3050)

LLAMA_SERVER_BIN_PATH=../../../cmake-build-debug/bin/llama-server python bench.py \
              --runner-label local \
              --name local \
              --branch `git rev-parse --abbrev-ref HEAD` \
              --commit `git rev-parse HEAD` \
              --scenario script.js \
              --duration 5m \
              --hf-repo ggml-org/models	 \
              --hf-file phi-2/ggml-model-q4_0.gguf \
              --model-path-prefix models \
              --parallel 4 \
              -ngl 33 \
              --batch-size 2048 \
              --ubatch-size	256 \
              --ctx-size 4096 \
              --n-prompts 200 \
              --max-prompt-tokens 256 \
              --max-tokens 256

Results:

srv  update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200

     ✓ success completion

     checks.....................................: 100.00% 165 out of 165
     data_received..............................: 4.5 MB  15 kB/s
     data_sent..................................: 96 kB   306 B/s
     dropped_iterations.........................: 35      0.111853/s
     http_req_duration..........................: avg=7.1s       min=794.72ms   med=4.13s      max=30.06s     p(90)=17.87s     p(95)=26.41s    
     http_req_sending...........................: avg=4.82ms     min=2.38ms     med=3.71ms     max=22.05ms    p(90)=7.24ms     p(95)=11.5ms    
     http_reqs..................................: 165     0.527306/s
     iteration_duration.........................: avg=7.4s       min=1.09s      med=4.43s      max=30.36s     p(90)=18.17s     p(95)=26.71s    
     iterations.................................: 165     0.527306/s
     llamacpp_completion_tokens.................: avg=126.915152 min=12         med=73         max=512        p(90)=327.6      p(95)=462.2     
     llamacpp_completion_tokens_total_counter...: 20941   66.923093/s
     llamacpp_completions_stop_rate.............: 95.75%  158 out of 165
   ✓ llamacpp_completions_truncated_rate........: 4.24%   7 out of 165
     llamacpp_emit_first_token_second...........: avg=0.161764   min=0.081      med=0.135      max=0.673      p(90)=0.2592     p(95)=0.3168    
     llamacpp_prompt_processing_second..........: avg=575.677944 min=100.149477 med=575.471698 max=877.862595 p(90)=724.111718 p(95)=761.987131
     llamacpp_prompt_tokens.....................: avg=91.824242  min=57         med=71         max=473        p(90)=148        p(95)=215.6     
     llamacpp_prompt_tokens_total_counter.......: 15151   48.419454/s
     llamacpp_tokens_second.....................: avg=18.933054  min=16.649324  med=18.722467  max=22.551929  p(90)=20.668266  p(95)=21.073767 
     sse_event..................................: 21270   67.974509/s
     vus........................................: 1       min=1          max=4
     vus_max....................................: 4       min=4          max=4


running (5m12.9s), 0/4 VUs, 165 complete and 0 interrupted iterations
default ✗ [==============================>-------] 4 VUs  5m12.9s/5m0s  165/200 shared iters
bench: shutting down server pid=7822 ...

- support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token

- fix when prometheus not started - wait for server to be ready before starting bench

ngxson

I don't have the hw to test, but LGTM.

Thought, I'm looking forward to migrate to python solution like Locust (as mentioned in the PR description). This can simplify a lot in the installation process, while giving much more flexibility for the script (Ideally, we only need single bench.py script in the future that can do all at once)

ngxson · 2024-12-31T11:49:51Z

examples/server/bench/script.js

@@ -89,6 +90,9 @@ export default function () {
        ],
        "model": model,
        "stream": true,
+        "stream_options": {
+          "include_usage": true, // False to be supported in llama.cpp server


Not sure what you mean here, but in llama.cpp we ignore include_usage and always include include the usage info.

server/bench:

1bf38cf

- support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token

phymbert added performance Speed related topics server labels Dec 10, 2024

github-actions bot added examples python python script changes labels Dec 10, 2024

server/bench:

fab46ca

- fix when prometheus not started - wait for server to be ready before starting bench

phymbert removed examples python python script changes labels Dec 27, 2024

phymbert marked this pull request as ready for review December 27, 2024 10:11

phymbert requested a review from ngxson as a code owner December 27, 2024 10:11

ngxson approved these changes Dec 31, 2024

View reviewed changes

phymbert merged commit 2f0ee84 into master Jan 2, 2025
9 checks passed

phymbert deleted the phymbert/server/bench/fix-streaming branch January 2, 2025 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: bench: minor fixes #10765

server: bench: minor fixes #10765

phymbert commented Dec 10, 2024 •

edited

Loading

ngxson left a comment

ngxson Dec 31, 2024

server: bench: minor fixes #10765

server: bench: minor fixes #10765

Conversation

phymbert commented Dec 10, 2024 • edited Loading

Context

Changes

Tests (phi2 on RTX 3050)

ngxson left a comment

Choose a reason for hiding this comment

ngxson Dec 31, 2024

Choose a reason for hiding this comment

phymbert commented Dec 10, 2024 •

edited

Loading